Learn about our project

Back-End

To be able to easily access our post classifying model we needed to put it in a web application online to create an interface with which a user can access that algorithm. You can type a potential post and get a suggestion of 5 subreddits that would be the most relevant or you can save it for later. You are also able to delete old posts after you have saved them.

-We use react to build front end : Components State management Account system Reddit posts Storing and Deletion.

-For the Back-End we built it using knex , heroku, postgres, Js and node modules : Data Base Tables API End-points User Authentication
Users can create an account and login they can send a post to our external api which will determine what the best subreddits for that post are. A user can then archive the post and will save that on our back end so that they can access it and change it any time they login

Algorithm

The first step for any data science problem is the data itself. When we started scraping reddit posts we searched about 100 of the most popular subreddits, plus a few we handpicked ourselves, and kept the ones that actually had text posts. It wouldn’t make sense for us to tell you to post your story in r/gifs would it? Once we were done we had over 31 thousand posts to work with, among 50 different subreddits.

Now that we had all these posts, we approached the problem as a document similarity one. To put it simply, our model takes a new post and decides which other posts it is most similar to. This works with any kind of classification model, so we tried many different kinds to see which would work best. The clear winner by far was a Random Forest model, our final version achieved 63% accuracy. And remember, if you randomly picked between 50 different categories you would only be correct 2% of the time! And when you consider the 3 or 5 most likely subreddits like our app does, it gets even more accurate.

Of course, this was all in a one week project. Given more time there would be tons of things we could do to improve this product, such as gathering more data, extensive cleaning, and model optimization.

Subreddits included:

buildapc
Jokes
personalfinance
nosleep
dadjokes
tifu
history
relationships
TwoXChromosomes
IAmA
askscience
Fitness
leagueoflegends
MachineLearning
LifeProTips
travel
webdev
nba
books
PS4
atheism
explainlikeimfive
pcmasterrace
Overwatch
movies
pokemon
gameofthrones
television
malefashionadvice
Music
trees
Android
gaming
lifehacks
WritingPrompts
Games
DIY
Showerthoughts
space
Futurology
Tinder
soccer
politics
listentothis
philosophy
GetMotivated
europe
gadgets
technology
tattoos